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ABSTRACT 


The development of large integrated data bases that support a ‘variety of applications in an 
enterprise promises to be one of the most important data processing activities of the next 
decade. The: effective utilization of such data bases depends off the ability of data base 
management systems to cope with the evolution of data base applications. In this thesis, we 
attempt to develop a methodology for monitoring the developing: pattern of access to a data 
base and for choosing near-optimal physical data base organizations based on the evidenced 
mode of use. More specifically, we consider the problem ‘of adaptively’setecting the set of 
secondary indices to be maintained in an integrated relational data base. Stress is placed on 
the acquisition of- an accurate usage model and:on the precise’ estimation of data base 
characteristics, through the use of: access monitoring and the application of forecasting and 
smoothing techniques. The cost model used to evaluate propésed:index sets is realistic and 
flexible enough to incorporate the overhead costs of index maintenance, creation, and 
storage. A heuristic algorithm is developed for the selection of a nearoptinal index set 
without an exhaustive enumeration of all Rees: ea 
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CHAPTER 1 
INTRODUCTION 


The development of large integrated data bases, each serving a wide variety of applications, 
promises to be one of the most important data processing activities of the next decade. The 
effective utilization of such data bases is highly dependent on the relationship between their 
physical organization and the prevailing modes of use to which they are put. In this thesis, 
we address the problem of optimizing the perforniance of an integrated data base by 


automatically adapting its physical organization to changing access requirements. 


1. Integrated Data bases 


An integrated data base may be defined as a collection of interrelated data stored without 
harmful or unnecessary redundancy and accessed in a uniform and controlled manner, 
serving one or more applications in an optimal fashion [Martin?5]. It may be viewed as the 
respository of information needed for performing certain functions in an enterprise. In 
addition to accesses (continuous retrievals and updates) by application programs for regular 
control functions, it may be used by interactive users for unanticipated information retrieval 


for planning purposes. 


The profits to be gained from the integration of ‘previously related but highly duplicated 
data bases are manifold [Martin?5, Chamberlin?é]. The elimination of unnecessary 
redundancy leads to reduced storage and updating costs. More importantly, the consistency 
of information stored in the data base is enhanced, since the possibility of having different 


copies of the data in different stages of updating is removed. Furthermore, the improved 
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coherence of the data will significantly increase the usability of the data base. By providing 
users with the capability of extracting any information that is logically contained in the data 
base, the generation of extensive printed reports on a scheduled base for manual analysis 


can often be avoided. 


In order for these data bases to be truly effective, the data management systems which 
support them will have to manifest two important characteristics: data independence and 
non-procedural access. By data independence we mean that users and their application 
progtaiis are shielded from knowledge of the actual physical organizations used to 
represent their data, concentrating on : logical view of the data. This makes the data base 
easy to use and avoids the need for application programs to change when the data base’s 
physical structure is reorganized. Non-procedural pene also wakes the data base easy to 
use; this means the provision of access languages which allow the specification of desired 
data in terms of properties it possesses rather than in terms of the search algorithm used to 


locate it in the data base. 


2. Relational Data Model 


The relational model [Codd70] of data has been proposed as a means of achieving the 
above goals of data-independence and non-procedural access. The user of a relational data 
base is provided with a simple and uniform view of the data, a logical view which is. 
completely independent of the actual storage structures, used to. represent their data. The 
simplicity of this logical data eideure lends itself to access by, cneane of ‘easy-to-use 
languages, which provide associative referencing (content addressing) of the data base 


contents. 
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Specifically, a relational data base consists of a collection of relations - a relation is a named 
two-dimensional table, which has a fixed number of (named¥ columns and an arbitrary 
number of (unnamed) rows (called tuples). Each tuple (ty tors tn) represents an entry in 
the relation; ¢,, the ith component of a tuple, is a member of Dj, the domain associated with 
the ith column. (Henceforth, we will use the terms column and domains interchangably.) 
The relation EMP depicted in Figure | has four columns; for each tuple of the relation, the 
corresponding columnar values represent the name, age, sex, and salary of the particular 
employee. Figure | represents a snapshot of the relation at a particular point in time; 
relational data base languages provide users the ability to selectively retrieve or modify 


individual tuples, as well as insert and delete tuples. 


EMP: TUPLE NAME AGE SEX SALARY 


1 Smith 30 M 16000 
2 James 25 M 12000 . 
3 Black 28 F 14000 
4. Brown 35 M 20000 
5 Jones 20 F 10000 
6 White. 40 F 16000 
7 Gray 35 M 15000 
8 Green 20. F 10000 
Figure 1 


A Sample Relation 


The table of Figure 1 is purely the user's logical:view.of the data base; there are no 


stipulations as to how this data would actually be stored on the computer. 
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In order to find the names of all male employees making more than $15,000, the user might 
express a query [Astrahan75) as 


SELECT NAME 

FROM EMP 

WHERE SEX = 'M' 

AND SAL > 15000. 
The query language processor would translate this specification of the desired information 
into searches on the data base that utilize the precise storage structures and auxiliary access 
mechanisms used to store the data in order to ‘locate the desised tuples and retrieve the 


specified column values. 


3. Relational Data Base Implementation 


Because of the distance of the user’s view of a relational data base (and of his queries 
against it) from the realities. of the data base’s physical. organization, more responsibility is 
placed on a relational data base system than on a conventional system. This responsibility 
takes two forms: choosing the physical representation for a relation; and optimizing the 
execution of queries against a relation, making optimally efficient use of the available access 
Structures. Relational data base systems must possess “intelligence” in order to make 
decisions in these areas, which have heretofore been the province of human decision- 


makers. 


We believe that the selection of good storage structures is the primary issue in relational 
data base implementation, since the efficiency that can be achieved. by a query optimizer is 
Strictly delimited by the available storage structures. Furthermore;the efficient utilization of 
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a data base is highly dependent on the optimal matching of its physical organization to its 
access requirements, as well as to other of its characteristics (such as the distribution of 
values in it). (For example, certain data base orgenizations are suitable for low update - 
high retrieval situations, while others yield optimal: performance in’ opposite circumstances.) 
Hence, the usage pattern of a data base should be ascertained and utilized in choosing its 
physical organization. In addition, when viewed as the repository of all information used in 
managing an enterprise, an integrated data base can no longer be considered as a static 
entity. Instead, it is continually changing in size, and {ts access requirements gradually alter 
as applications evolve and users develop familiarity with the system. Accordingly, the 
tuning of a data base’s physical organization to fit its usage pattern must be an ongoing 
process. 


In current relational data base systems, the data base administrator (DBA) may make 
recommendations to the system about desirable: auxiliary’access structures, but his 
judgements are based largely on intuition and on a timited amount of communication with 
some individual users. For large integrated data bases, a more systematic means for 
acquiring information about data base usage, and a more algorithmic way of evaluating the 
costs of alternative configurations, will be essential: A minimat capability of a data base 
management system should be the incorporation of monitoring mechanisms that colfect 
usage statistics while performing query processing. A more sophisticated system would sense 
a change in access requirements, evaluate the cost/benefits of various reorganization 
strategies, and choose an optimal structure to be recommended tothe DBA; eventually, such 


a system might itself perform the necessary tuning. 


The work to be reported in this thesis is part of an. ongoing research effort to develop a 


self-adaptive data. base management system. The intent of this. development: is twofold: to 
develop the techniques and methodology for the construction, of such systems, and to do 
performance analysis of these techniques so as.to assess their casts and payoffs. 


The operation of the initial version of the prototype system is envisioned as follows. The 
Specifications of data base interactions, by both interactive users and application 
programmers, will be expressed in a non-procedural language; these are first translated into 
an a high level procedural system level interface language, which is then interpreted by. the 
system modules. The language processor has available to it a model of the current state of 
the data base, which contains, among other things, a description of the current physical 
organization of the data base, and estimations of the characteristics of the data base’s 
current contents. Using this information, the language processor can choose the best 
Strategy for processing each data base operation in the current environment. Statistics 
gathering mechanisms are embedded within the system. modules that interpret the ob ject 
code of the language processor, and record data concerning the execution of every data base 
transaction. The statistical information gathered for a run is deposited in a collection area 
and summarized from time to time. When the reorganization camponent of the system is 
invoked (which will be at fixed intervals of time), the statistical information collected over 
the preceding interval is combined with statistics from previous interval and used to obtain 
a forecast of the access requirements for the upcoming interval in addition, a projected 
assessment of various characteristics of the data in the data base is made. A near-optimal 


physical organization for the data base is then determined heuristically; optimality means 
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with respect to total cost for the upcoming interval, taking into account the storage and 
maintenance cost of any auxiliary access structures. This-cost is compared with the 
projected cost for the existing organization. Reorganization will be performed only if its 
payoff is great enough to cover an appropriate fraction of its cost as well as that of 


e% 


application program retranstation. 


5. Thesis Objectives 


The principal goal of this thesis is to devetop the techniques and methodology-for the 
construction of self-adaptive data base management systems: Af‘its ‘heart, this is a problem 
in pattern recognition, statistical forecasting and artficial intelligence: first, to extract from 
a mass of statistics relating to data base performance ’a succinct pattern which ‘characterizes 
its mode of use; second, to apply forecasting techniques developed in management science in 
the detection of shifts in usage pattern and the projection of upcoming access requirements; 
third, since an exhaustive consideration of all possible structures is computationally 
infeasible, to develop efficient heuristics that can use the projected usage pattern to 


synthesize a near-optimal structure. 


The continuous monitoring of accesses to a data base opens up many possibilities for its 
reorganization. Rather than providing a comprehensive study on reorganization 
possibilities in a data base management system, we have limited the’ scope of our initial 
investigation to a well-defined aspect of data base reorganization, so as to obtain some 
concrete results. We have chosen as the vehicle for this study the problem of index 
selection in a relational data base. A secondary’index {sometimes referred to as an 


inversion) is a well-known software structure which-can improve the performance of 
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accesses to a relation (file) [Bleir67, Date75, Martin?5). For each domain (field) of the 
relation that is indexed, a table is maintained, which for each value of the domain in 
question contains pointers to all those tuples (records) whose contents in the designated 
: domain is the specified value. Clearly, the presence of a secondary index for a particular 
domain can improve the execution of many queries that reference that domain; on the 
other hand, manitenance of such an index has costs that slow down the performance of data 
base updates, insertions, and deletions. Roughly speaking, a domain that is referenced. 
frequently relative to its modification is a good candidate for index maintenance. The 
choice of which (if any) domains ta index must be done with. care; a good choice can 
significantly improve the performance of the system, while a bad selection can seriously 
degrade it. The goal of our system is ta make a good choice of those domains for which to 
maintain secondary indices, based on how the data base is actually used. 


6. Approach 


There have been a number of previous studies on the. index selection problem [Lum/7l, 
King74, Stonebraker74, Cardenas75, Held7?5b, Farley?5, Scholnick75). However, we feel that 
the results that have been obtained are not directly applicable to a complete or general data 
base environment. Some of these have been formal analyses which have made many 
simplifying assumptions in order to obtain analytic solutions; others have been system 
designs that are incomplete or unrealistic in many ways. Our thrust here is to relax many 
of the simplifying assumptions made in previous studies and to develop more complete and 
accurate models of costs and accesses. In addition, we will stress the importance of the 
acquistion of accurate parameters to the cost model, an area which is of special significance 


in a dynamic environment where access requirements are continually changing, but which 
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has hardly been addressed in previous works. Four basic components of our investigations 
can be identified: 


(1) the development of accurate cost models for the processing: of dita base transactions 
under different indexing organization (i.e. when different sets of domains are 


indexed), 


(2) the identification of the set of usage parameters that succinctly characterize the data 
base usage and which can be inexpensively acquired during the processing of data 


base transactions; 


(3) the application of appropriate forecasting techniques to detect and respond to shifts in 


access patterns and data characteristics; 


(4) the design of heuristic computation procedures that exploit the structure of the index 
selection problem in the synthesis of a near-optimal data base organization (i.e. 


choosing a near optimal index set) at a reasonable cost. 


7. Organization 


The rest of this thesis is organized as follows. Chapter Two summarizes the data base 
environment which we shall utilize: the data model, the transaction model, the storage and 
index organizations, and the various assumptions we have made. In Chapter Three, we 
present our cost analysis for various basic operations in the data base and describe the 


ob jective cost function that we will attempt to minimize. Then in Chapter Four, we explain 


Chapter | 16 Introduction 


how parameters needed by our cost model are acquired through statistics gathering and 
application of forecasting techniques. In chapter Five, we argue for the need of heuristics 
for the solution of the index selection problem and describe the heuristics we have devised. 


Finally, Chapter Six includes summary, conclusions and suggestions for future research. 
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CHAPTER 2 
- DATA BASE ORGANIZATION 


In this chapter, we describe the data base environment we have assumed in our research. 
Our discussions will be based on a rather’ generalmodel of the data base, one which can 
readily be extended to characterize a large variety of existing systems. We will describe the 
storage and access structures in the data base, the kinds of transactions that may be 
conducted against it, and the way transactions are procetsed. In addition, we will contrast 
our assumptions with those employed in previous studies which wé feel to be incomplete or 


unrealistic. 


1. File Model 


As we have said, we operate in the environment of a relational data base. The totality of 
formatted data in the data base consists therefore of one or more: relations. However, we 
address here the reduced problem of selecting ‘indices for a data base made of a single 
relation. (We expect that extensions can be made to the general multi-relation case.) Even 
though insertion and deletion of tuples are permitted in our transaction model, we will 
assume that the cardinality (number of tuples) of the relation remains relatively‘unchanged 
between two consecutive points at which index selection is considered (i.e. the rate of 


change in size of the data base per review interval is small). 


2. Access Model. » 


ie 


Previous studies on index selection have often ‘assumed rather unrealistic access models: 
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both King [King74] and Schkolnick [Schotnick?5] have assumed that the cost of accessing 
an arbitrary subset of all the tuples in a relation is directly proportional to the size of the 
set. This will be true only if all tuples are equally accessible, as in the case when they all 
reside in primary memory, or equally inaccessible, as in.the case when each is independently 
Stored on secondary storage. For data bases of reasonable size and reasonable tuple length, 
neither assumption will hold. 


In this study, we will assume that the totality of the data base (both the stored representation 
of the relation and the set. of secondary indices that are maintained) resides on direct access 
secondary storage devices like disks [Rothnie74, Blasgen76}. Physical storage space on such 
devices is partitioned into fixed size blocks called pages. The page is the unit of memory 
allocation and the unit of transfer between main memory and secondary storage. The 
accessing cost of a page is assumed to be independent. of the sequence of page accesses. 
Furthermore, we will assume that the system is I/O. bound, so.that page accessing cost 
dominates all other internal processing costs. Hence, the processing cost for a data base 
transaction is measured solely in terms of the number of pages that have to be accessed in 


its processing. 


3. Tuple Organization 


We will assume that tuples are of fixed length (i.e. each occupies the same amount of 
physical storage space) so that each page has a capacity for a fixed number of tuples. To 
retrieve all the tuples in the relation, a scan of all the pages on which the tuples reside can 
be performed. (Henceforth, we will refer to these pages as the segment on which the 


relation is stored.) The cost of this sequential scan is just.p pages, where p is the number of 
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pages in the segment. However, in many instances, only a small subset of the tuples will 
actually be required for processing; hence, it is desirable to provide additional access paths 
to enable access to just those tuples that are needed. In other situations, all tuples may be 
required, but in a specific sort order. If the required ordering is different from the one in 
which tuples are physically stored, then sorting will be required; for typical sizes of data 
bases, an external sort would’ be in order, and would entail going over the data in several 
passes. Hence, it is desirable to physically cluster together tuples that are needed together. 
Held and Stonebraker [Held75b] have investigated a variety of organizations for storing 
tuples of a relation on pages of a direct access file, and have made a broad categorization of 
keyed structures versus non-keyed structures. A key structure is one in-which a domain (or a 
combination of domains) is used to determine where in secondary storage the tuple should 
be stored. The advantage of a keyed structure is that tuples that are often needed together 
can be clustered together physically. However, any modification to a tuple in the keyed 
domain(s) will require the tuple to be relocated. Hence, all index entries that point to this 
tuple will have to be modified also. A non-keyed structure is one in which the tuples are 
stored using some criteria that is independent of the value of the tuple. The advantage of 
a non-keyed structure is that it enables auxiliary access structures like indices to be 


maintained more economically. 


For the purpose of this thesis, we will assume that the tuples in the relation are organized as 
a non-keyed structure. We will assume that they are stored sequentially on the pages of the 
segment without any preferred ordering. (For example, they might be stored according to 
their chronological order of insertion into the data base.) Since the cost of a sequential scan 
is dependent on the number of pages in the segment, it is essential that the storage 


utilization in the tuple space be maximized, so as to minimize the cost of segment scans. We 
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will assume that all empty spaces resulting from the deletion of tuples will be reused for 
newly inserted tuples, before a new page is allocated for the segment. (This can readily be 
done by keeping a linked list of the empty spaces in the segment. The linked list can be 
Stored in the empty spaces in the segment itself. Only a pointer to the head of the list need 
to be maintained separately for the purpose of storage allecation in the tuple space.) Even. 
with the above assumption, poor storage utilization.can still.resuit. from a long. sequence. of 
tuple insertions followed by a long sequence:of tuple deletions. On the other hand, garbage 
collection in the tuple space would have to be accompanied by the: modification of all those 
index entries for tuples that.are relocated. To simplify..our discussions. here, we will finesse 
the need to garbage.collect in the tuple space by assuming that there are no clustered 
deletions of tuples from the same page, and that the general trend .is for the data- base to 
grow in size. (Note that we could readily include garbage-collection overhead in our cost 
model by monitoring the average number of tuples. that are relocated per review interval, in 
addition to the actual number of insertions, deletions and modifications, towards the 


estimation of index maintenance cost.) 


4. Index Organization. 


We assume that each tuple in the relation has associated with it a unique tuple identifier 
(TID), a logical address which enables the tuple to be lecated. with a single page access. An 
index on a column of a relation is then a mapping from values in the column ta TIDs of. 
tuples with those values. Conceptually, an index may be viewed as a binary relation 
consisting of pairs whose first component is.a value from the column and whose second 
component is the TID of a tuple with that value. Figure 2 shows.an index on the column 
salary for the EMP relation depicted in Figure 1. (This. sequential. organization is actually 
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assumed in [King74].) However, as Cardenas [Cardenas75] has pointed out, the 
organization of the index is itself an important problem in the enhancement of system 
‘performance. We will therefore assume that the index is organized in such a way that all 
those TIDs associated with the same column value are easily accessible. Specifically, we will 
adopt the VSAM-like tree organization as used in Blasgen’s study (Wagner73, Blasgen76]. 
Figure 3 shows how the index shown in Figure 2 will actually be stored. It is a balanced 
tree whose nodes are index pages. Leaf pages contain pairs wHose first component is a 
column value and whose second component is a sorted list of the TIDs of those tuples with 
that column value. The pairs in each leaf-page are sorted on the value of their first 
component. Higher level pages contains pairs consisting of the identifier of a lower page 
and the high key value on it. These pairs are also sorted by the values of their first 
components. The tree is kept balanced on insertion or deletion in a way that is similar to 


the maintenance of B-trees [Bayer72], with the splitting and merging of pages as necessary. 
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Figure 3 


Physical Organization of an Index 
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5S. Transaction Model and Processing © 


We will consider four types of transactions that may be conducted against the data base: 
the retrieval, insertion, updating and deletion of tuples. An update or a delete operation is 
often specified in two components: a selection component which determines the set of tuples 
to be processed, and an action component (which in the case of an update, determines how 
each tuple is to be processed). As will be discussed below, the use of an index (or indices) to 
identify the set of tuples that satisty (or potentially satisf y a selection component entails a 
number of steps, after which we can no longer assume that any part of an index still reside 
in primary memory. Hence, we can assume that the maintenance to the indices due to an 
update or delete is independent of the selection component of the transaction (i.e., the 
maintenance cost of an index due to a tuple deletion or ‘modifi ication is the same regardless 
of whether the upre is identified thfough the use e of that index’ or through a sequential 
scan). Therefore, we will assume that the data base transactions specitied in the source 
language get translated by the language processor into sequences of queries updates, inserts, 


and deletes, as described below. 


(1) Query - this can result either from a retrieval pect ication in the source language or 
from the data selection component of an update or delete specifi ication as discussed 
above. It enables those tuples to be pecrier ed: or acted upon to be specif ied in terms of 
the properties they possess. In relational access languages that are currently being 
developed, powerful and general data selection predicates are allowed [Codd71, 
Boyce74, Astrahan75, Held75a, Crarnik?8) However, in order to be able to evaluate 
the tradeoffs of a particular index, we shall nent ourselves to the consideration of only 


those data selection predicates for whose processing the utility of indices can readily be 
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(a) 


(b) 


determined. We will therefore allow only the following predicate types: 


a predicate consisting of a single equality condition or a conjunction of two or more 


equality conditions; 
a predicate consisting of a disjunction of two or more equality conditions. 


(By an equality condition, we mean a predicate of the form A=k, where A is some 


domain name, and k is a constant or program variable.) 


Henceforth, we will refer to the process of identifying the set of tuples that satisfy 
(qualify for) the selection predicate associated with a query as the resolving of the © 
query. (A retrieval specification in the source language may in addition to the selection 
of tuples, specify what fields in the selected tuples are to be output or further 
processed. However, the time to perform these operations is. independent of which 
indices are maintained, and so will be ignored in our disucssion here.) For a query 
arising from a delete or update specification in the ssiirce language, we will assume 
that each qualified tuple is returned accompanied by its TID, thus allowing it to be 
identified in subsequent delete or update operations. | 


An index (or a set of indices) can be used to totally or partially resolve a query. (A 
query is said to be totally resolved when the exact set of tuples” that Satisfy the 
associated selection predicate is identified, and it is said to be partially resolved if a 
superset (but one which is smaller than the entire set of tuples in the relation) of those 


tuples that satisfy the associated selection predicate is identif ied. This partially 
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(a) 


qualified set of tuples will have to be brought into main memory and each tuple must 
be examined to determine if it satisfies the full predicate.) Given a data selection 
predicate and an existing set of indices, there are a number of possible strategies for 
obtaining the set of selected tuples. Depending on the nature of the predicate and the 
selectivities of the domains involved, it may be most profitable to use none, all, or a 
subset of the applicable indices. In previous studies, it is assumed that indices are used 
whenever they are available. However, as we will see from our cost analysis in the 
next chapter, there may be situations in which it would be most economical to use less 
than the full set of applicable indices in resolving a query. For simplicity, we will 
assume that the query processor uses the following decision<criterion: it will evaluate 
the expected cost of processing the query using the full set of applicable indices (i.e. 
existing indices on those columns which are specified in the query) and will use all of 


them only if this expected cost is less that of a sequentiat scan; otherwise a sequential 


‘scan will be utilized. (Note that a disjunctive query will be resolved using indices only 


if indices on all of the domains referenced in the query are available; a tuple that 
does not satisfy any of the predicates resolved through indices can still potentially 
Satisfy those predicates on domains for which no indices exist, and hence some 


qualified tuples can only be identified through a sequential scan of the entire segment.) 


We will assume that a query is processed using indices as follows: 


For each domain specified in the query and for which an index exists, a list containing 


the TIDs of all those tuples that satisfy the equality condition on the column in 


question is obtained; 
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(b) Depending on whether the selection predicate is a conjunction or disjunction, an 


(c) 


(2) 


intersection or union listof all those lists obtained .in step (a) is formed. This restricted 
list contains the TIDs of all thase tuples that satisfy the conjunction or: disjunction of 
those equality predicates involving domains. for which indices. exist; 


This restricted set of tuples is brought to main memory for further processing. (In the 
case of a canjunctive query that has only been partially resolved, ie, the restricted set 
of tuples only satisfy the conjunction of those equality predicates. involving domains 
which are indexed, each of the restricted. tuples is checked against the equality 
conditions involving the non-indexed domains.and then discarded or retained 
accordingly. (This is sometimes known as the removal_of false-drops (Schkolnick75]) 


(In the process of obtaining the TID list for. the restricted tuples that have to be 
accessed, it is possible that some of the TID lists involved are se long that. they cannot 
completely reside in primary memory. Therefore, we will. assume that the list 
manipulation phase is combined with the tuple access. phase:. ie, we will assume that 
the individual TID lists are in the same sort order, so that. the union or intersection 
process can be carried. out in a single pass over ail of. the participating lists [Weich76]. 
By utilizing portions of the resulting TID list as soon as it is available, extra page 
accessing can be avoided.) 


Insert - this inserts a single tuple inte the relation. It is specified by supplying a value 
for each of fields.in the tuple to. be inserted, and. results in the tuple’s insertion into the 
main file, together with the necessary maintenance to the existing. set of indices. 
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(3) Delete - this deletes a single tuple from the relation. It is specified by supplying the 
TID of the tuple to be deleted, together with values in different fields of the tuple, 
and results in the deletion of the tuple from the stored representation of the relation, 


and the necessary maintenance to the existing set of indices to reflect this deletion. 


(4) Update - this involves a single tuple in the relation. It is specified by three 
components: the TID of the tuple to be updated, its old component values, and its new 
component values. It causes the tuple to be updated, and the indices on the affected 


domains modified accordingly. 


6. Query Distribution 


In earlier index selection studies, simplifying assumptions on query distributions are often 
adopted. In [King74], it is assumed that selection predicates only consist of of single equality 
conditions. Hence, it is sufficient to summarize the statistics on query distribution by the 
probabilities of each domain being specified in a selection predicate. In [Scholnick75], the 
restriction to the consideration of single-domain queries is relaxed, but with the imposition 
of the new assumption that domain occurrence probabilities in queries are independent. 
Hence, the model that is used there is unable to account for the positive or negative 
correlation among domain occurrences in queries; such correlation is common in the usage 
of real data bases. (For example, in queries on the EMP relation (Figure 1), age and salary 
might often be specified together while name is more likely to be specified alone.) In this 
study, we will do away with any such simplifying assumption by observing the occurrence 


frequencies of those queries that actually occur. 
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7. Domain Value Distribution 


In earlier work, it is often assumed that the set of distinct values in a domain is evenly 
distributed among tuples in the relation, and that all domain values are equally likely to be 
specified in the constant part of a selection predicate. Consequently, the average fraction of 
tuples that can be expected to satisfy an equality condition on a domain is assumed to be the 
reciprocal of the number of values in the domain. However, in a real data base, it is often 
the case that the distribution of domain values among tuples and in query specifications are 
skewed; i.e. some values are used more often than others. We would like to take advantage 
of our continuous monitoring facility to detect such situations. We will therefore define a 
new measure for the resolving power of a domain index. We define the average selectivity: 
of a domain as the average fraction of tuples under consideration that have historically 


Satisfied an equality condition involving that demain. 


Since we allow the specification of multiple domains in queries, it is necessary to have a 
measure for the joint resolving power of two or more domain indices. For this purpose, we 
will assume that the specification of values from different domains in a query are 
uncorrelated (i.e, given that a query specifies two columns A and B, the probability of a 
particular key value in column B being specified is independent of the value in column A 
that is specified). Hence, the joint conjunctive selectivity of a set of domains D, each with 


average selectivity AS; is 
(2. 1) Diep ASy 


(The interpretation of this expression is that the expected fraction of tuples that satisfy a 
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number of predicates simultaneously is equal to the product of the individual expected 
fractions that satisfy the predicates.) Similarly, the joint disjunctive selectivity of a set of 


domains D, each with average selectivity AS, is 


(2. 2) 1- yep (1 - ASy) 


(The interpretation of this is that the expected fraction of tuples that satisfy a disjunction 
of equality conditions is the complement of the fraction expected not to satisfy any of the 


equality conditions in the disjunction.) 


8. Objective of Index Selection 


We assume that index selection will be reconsidered at fixed intervals and that usage 
Statistics are collected during the processing of each transaction in the data base and 
summarized at the end of each interval. The objective of the selection process is to 
minimize the total system cost for the upcoming interval. This total cost includes retrieval 
processing; index creation, maintenance and storage; and application program 
retranslation. In contrast with previous studies, we have chosen to minimize this total cost, 
rather than using a probabilistic model of data base transactions and attempting to 
minimize only the expected cost of an average transaction. Our information on the absolute 
level of activities in the data base (in addition to their relative levels) allows us to amortize 
such cost as index creation, index storage and application program retranslation over the 


data base transactions, rather than completely omitting them from the cost model. 
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CHAPTER 3 
COST ANALYSIS 


One of the most important tasks in the analysis and enhancement of performance of any 
system is the determination of the set of parameters that have a significant effect on 
performance and the formulation of the cost model relating system performance to these 
parameters. Since we operate in a.dynamic environment, we have to resort to the continuous 
monitoring of data base transactions to obtain the parameters in our system. As we shall 
see, most of these parameters can be directly measured. However, there are others that are 
not directly observable, in which case we have to relate them to statistics that can be readily 
obtained through statistics gathering. In this schageer,. we will analyze the cast of various 
basic operations in the data base system and then discuss the objective cost function that 


our index selection procedure will endeavour to minimize. 


As we have seen, the processing of a query using indices involves the retrieval of the 
relevant TID lists from. the indices, the manipulation of these lists to obtain an intersection 
or union list, and the accessing of the restricted set of tuples as identified by the resultant 
list. As in previous studies, we assume that the manipulation of the TID lists is done in 
main memory, and is therefore negligible according to our cost criterion of page accesses. 
Similarly, any need to remove “false-drops” from the restricted set of tuples is done in 
primary memory at a negligible cost. Hence, the processing cost of a query using indices can 
be assumed to be made up of two components: the cost of using the relevant indices and 
the cost of accessing the restricted set of tuples. As regards to modifications to the data base 
(update, insert and delete), maintenance of the existing indices in addition to modifications 


of the stored representation of the relation must be made. Since the latter cost is incurred 


Chapter 3 31 Cost Analysis 


regardless of what domains are indexed, we will ignore it from further consideration, and 
concentrate only on the maintenance cost of the indices in determining the costs of data base 


modification. 


1. Tuple Access 


In order to compare the utility of different sets of indices towards the processing of a query, 
we need to have an estimate of the sizes of the sets of tuples that must be accessed in order 
to resolve the query, given the availability of each of the index sets. We have earlier 
defined the average selectivity of a domain as a measure for the resolving power of an 
index on that domain. Using the selectivity of the domains specified in a query which are 
indexed, we can estimate the number of tuples that have to be examined to evaluate all the 
predicates. Since our cost criterion is the number of pages that have to be accessed, we have 
to translate this expected number of tuple accesses to an expected number of page accesses. 
We feel most previous index selection studies have been inaccurate in their choice of cost 
model for the accessing of such a restricted set of tuples. In [Scholnick75] linear relationship 
between the number of tuples to access and the accessing cost is assumed. This is equivalent 
to saying that each tuple specified in the resulting TID list will incur one page access. In 
(Held75b], a piecewise linear relationship is assumed: if the relation is stored as p pages of t 
tuples each, and r tuples are to be accessed, then the number of page accesses is assumed to 
be min(r, p). In a paged memory environment in which tuples are blocked together on 
pages, neither of the above schemes accurately model the tuple accessing process (since the 
restricted set of tuples can be accessed in the order of their TIDs so that tuples from the 
same page will incur only a single page access). A more realistic scheme to estimate the 


accessing cost for r tuples is to assume that they are equally likely to be any r tuples in the 
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segment, and to use the expected number of page accesses for r randomly selected tuptes in 
the relation as an estimate for the tuple accessing cost for a query whose resolution under 
the availability of a particular set of indices is expected to require the retrieval of r tuples. 
This expected number has been considered in a number of previous studies [Rothnie72, 
Schmid75, Yue75). However, the formulations that have been derived are often 
computationally infeasible or inaccurate. Based on a Markov model approximation to the 
underlying process of accessing r randomly selected tuples, Rothnie [Rothnie72] has obtained. 
a lower and upper bound on the expected number of pages that have to be touched. 
Schmid and Bernstein [Schmid75] using a combinatorial analysis, have derived an exact 
formulation that involves a complicated recurrence relation whose computation for moderate 
values of the parameters becomes very costly and inaccurate because of the significant 
round-off errors encountered. The following formulation, due to Yue and Wong (Yue75], is 


by far the most satisfactory. 


Let n = number of tuples in segment 


t = number of tuples per page 
p = number of pages in segment 
f(r) = expected number of page access for r randomly selected tuples, then 
(3. 1) £(0) = 0 
-1)-i 
(3, 2) f(r+1) = : £(i) + : 
-j pt- 


The value of f for an arbitrary value of r can be computed from the recurrence relation 
with relatively little round-off error. However, this computation involves r multiplications 


and r divisions and is therefore quite expensive to carry out. We (in conjunction with 
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Bahram Niamir of the MIT Laboratory for Computer Science) have obtained a closed form 


solution of the above difference equation which can be computed more efficiently. 


n 
(3. 3) f(r) = - (: - 


A detailed derivation of this formulation is included in Appendix 1. The above 
formulation also admits of a simple interpretation. Consider an arbitrary page in the 
segment; the probability that it does not contain any of the r desired tuples is equal fo the 
number of ways of choosing t tuples from n - r tuples, divided by the number of ways of 
choosing t tuples from n tuples. Hence, the expression within the parenthesis gives the 
probability that this page contains one or more of the r desired tuples. Thus, multiplying 
this expression by the total number of pages in the segment gives the number of pages 
expected to contain one or more of the desired tuples, ie, the expected number of page 


accesses. 


For a fixed value of p (say 1000), and for a typical value of t (say 50), the shape of the 
function f(r) is shown in curve #3 of Figure 4. It is instructive to note that for values of r 
close to, but less that p, the value of f(r) is roughly 0.6p, which is substiantially different 
from the value given by a linear cost function. (Curve #1 indicates a linear cost function 


(Scholnick75] while curve #2 indicates a piecewise linear cost function [Held75b].) 
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2. Index Creation 


In order to determine if reorganization is worthwhile, we need to have an estimate for its 
costs and its payoffs. Two major components of the reorganization cost (due to a change in 
indexing policy) are: 


(1) cost of retranslating existing application programs; 
(2) cost of creating the new indices. 


The former can be estimated from the previous translation costs of the individual 
application programs. (In many systems, the data manipulation language is interpreted in 
which case no retranslation cost is incurred as a result of physical reorganization.) The 
latter, in general, depends both on the current size of the relation and on the number of 


distinct key values in each of the domains. we assume that an index is created as follows. 


(1) For each tuple in the relation, a pair consisting of the value of the tuple for the 


indexed domain, and of the tuple identifier is formed. 


(2) These pairs are sorted, with the domain value as the major sort key, and the tuple 
identifier as the minor sort key. (Typically, this will involve an external sort consisting 


of a sorting and a merging phase.) 


(3) A data structure (see Figure 3) that facilitates the accessing of the list of tuple 


identifiers for tuples associated with any value in the domain is constructed from the 
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sorted pairs of domain values and tuple id. 


h 


number of tupies in relation 
= number of pages in segment 
number of words per page 


or z< 
i] 


= number of pages available for internat buffering 
In = number of words in the representation of a-key value in the demain 


Step one involves the scanning of the segment and the formation of the pairs of key value 
and TID. For practical sizes of the data base, these pairs have to be written back to 
secondary memory for temporary storage. This can be: combined with the initial internal 
Sorting phase of step two with a.cost of | 

(3. 4) p + [fa(inei) /w] 

where p is the cost of scanning tuples in the segment and [n{Inel)/w] is the number of 
pages needed for the writing out of the n pairs (each of length In+l) of demain value and 
TID into the initial sorted subfiles. Let 


(3. 5) p’ = fn (nei) /w) 


Then we will assume that at the end of step one, s - |-subfites of length b and one of length 
b’ are formed where 


(3. 6} s = fp'/b} 
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b if mod (p',b) = 0 
(3.7) ot = ( gia tees } 


mod (p', b) otherwise | 


The cost of merging these s subfiles is derived in Appendix 2 and is given by 


However, we note that in the last pass of the merging process, instead of writing out the b » 
(s - 1) + b’ pages for the single sorted file, we will build the VSAM-like tree for the index. 


Hence, the cost of the second phase of the index creation procedure is 


Finally, the cost of the third phase consists of writing out the leaf and node pages of the 
index tree and can be estimated as follows. (We will assume that pages in an index are not 


filled to capacity at creation time, so as to facilitate subsequent modifications.) 


Let U, = initial fraction of utilization in a node page 
U, = initial fraction of utilization in a leaf page 


Vv = number of distinct key values in the indexed domain 


oO 
u 


number of key pointer pairs a node page can contain 


ny 
C] 


initial number of key pointer pairs with which a node page is filled (#u,,2c) 


then the number of leaf pages If is 
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(3. 10) 1f = (v* In + n)/(u, ® W) 
The height h of the index tree is 
(3. 11) h = flog ,1f] 


(where the leaf nodes are at height 0), and the number of node pages is 


(3. 12) (LE/k] + [PLE/KE/KT + [[ME/KY/KT/KT + 00000. 
= 1f/k + LE/k2 + fA + o.oo, 
= lf/(k - 1) 


From the above analysis, we also have a rough estimate for the storage requirement of an 


index on the domain in question, which is 

(3. 13) lf + 1f/(k - 1) 

The above analysis has been motivated by the need to estimate the costs of index creation 
and storage. However, it depends very much on the number of distinct keys in the indexed 


domain, for which we can only have a rough estimate. Consequently, it will be difficult to 


come up with a close estimate of the index creation and storage costs. 


3. Index Accessing and Maintenance 


(The average cost of using an index as well as the total maintenance cost of an index within 
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an interval can be directly measured. The purpose of the analysis below is only for the 


purpose of estimating these parameters for those domains which are not indexed.) 


The use of an index to obtain the list of tuple identifiers with a particular value in the 
indexed domain involves starting from the root of the free, and following a path through 
the node pages which leads to the leaf page containing the desited TID list. Let h be the 
_ height of the index (where the root of the tree is at height 0, and the leaf pages are at 
| height h), then the cost of using the index to obtain an average TID list can be éstimated as 


(3. 14) h + fl£/v] 


Similarly, the cost of rhodif ying a TID list in a leaf of the index es a result of a tuple 


insertion or déletion) when no overflow or underflow is incurred, is 
(3.15) h + 2 [lf/v) 


(The maintenance to an index due to thé update of a tuple in the indexed domain can be 
assumed to be the sum of the maintenaiice due 'to a delete and an insért.) The cost of index’ 
maintenance due to the splitting and merging (or garbage collection) in the index is more 
difficult to parameterize, since it depends on the actual sequence of tuple insertions, 
deletions and modifications. This component has often been completely ignored in previous 
studies. Here, we can add to ‘the above tost an average overhead cbst per modification, a 
parameter which can be obtained by’ monitoring the actual‘ maintenante ofan index. For 
those domains that have not been previously index, the normalized average overhead 


among those indices that have been maintained ‘tah ‘be‘used as a Very rough estimate. 
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4. Total System Cost 


With full knowledge of the upcoming requirements, the total system cost for the next 
interval under a particular indexing policy is computed as follows. For each query type, we 
can determine, using the selectivities of the domains that occur both in the query and in the 
proposed index set, haw many tuples will need to be scanned to resolve this query type, with 
the full use of the indices. Our non-linear cost function translates this into an expected 
number of page accesses. To this is added the expected number of page accesses that are 
involved in accessing the indices themselves. This then gives us the total processing cost for 
this query type, if the proposed indices are used. We then know if, the query processor 
would, given the proposed index set, use them to resolve this query type or would process it 
by means of a sequential scan. In any. event, we thus know the projected cost of processing. 
this query type in the presence of the proposed set of indices. We multiply this cost by the 
expected frequency of this query type, and repeat the process for all the query types. This 
gives the projected total query processing cost. Adding to this. the prajected indexing. costs 
(creation (if applicable), maintenance (due to tuple insertion, deletion and modification) and 
Storage) and the application program retranslation costs (nil for the index set which is 
identical to the one that is maintained in the previous interval) yields the tatal system cost 


for the next interval. 


Let CC, = expected creation cost af index on domain. (if not already exist) 
MC, = expected maintenance cost of index on domain i 
SC; = expectedstorage cost of index on.domain i 
AC, = expected cost af obtaining.a TID, list using an index.on domain.i 
AS, = average selectivity of domain i 
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Q = projected set of queries in upéoming interval 
Fy = occurrence frequency of query q where qe Q 

n = average number tuples in relation 

p = average number of pages in stored representation of relation 
D, = set of domains indexed in the previous interval 
D, # set of domains specified in query q 


4 
® 


q = type of query q (0 if con junctive, | if disjunctive) 


RC(D) = application program retransiation cost, 0 if D = Do 


Ig (D) = lif Dy ¢ D, 6 otherwise 
C, (D) = cost of processing query q with the index set dD 
= (iT g)eminip ld, , Dad ACF Da p,ASive))) + 


T gAttlqbopelgemin(p (By «p AC) f(T ep (r8,))en)) 


The objective of the index selection procedure is then to select the‘index set_D which 


minimizes the following expression: 


(3. 16) Bien (Cy + MCy + SCy) + Bagg Fy # Cq(D) + CCD) 
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CHAPTER. 4 
PARAMETER ACQUISITION 


A fundamental problem in an adaptive system operating in a dynamic and uncertain 
environment is the exploitation of new information in reducing the uncertainty of the 
system. In our context, this involves the utilization of observed data on how access 
requirements and data characteristics change over time in the estimation of exogenous 
(uncontrollable) parameters essential for predicting the performance. of different indexing 
organizations for the planning horizon. In the fallawiag sections; we. will: describe the 
Statistics that are to be collected during transaction processing. We. will then explain our 
choice of forecasting technique and how the. various parameters in the. system are to be 


forecasted. 

1. Statistics Gathering 

Statistics are collected during the procesing of each data base transaction for two purposes: 

(1) as a direct measurement of certain system parameters in the current time interval, 

(2) to be used in the indirect estimation of certain system parameters, parameters that 
cannot be measured directly or whose direct measurement would entail excessive 


overhead. 


The statistics to be gathered for the purpose of index selection fall into four general classes. 
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(1) 


(2) 


(3) 


Index Maintenance Statistics - This has several components. . First of all, there is the 
total maintenance cost. of each active index in the current interval. For domains that 
are not indexed, we need to obtain an estimate of the cost that: might have been 
expended had an.index.been maintained:on each of these-domains. For this purpose, 
we record. the tota] number of tuples that are-deleted. from:and inserted into the 
relation in the current interval, and the number. of updates to-each domain-in the 
tuples. In addition, we will break down. the maintenance cost of each active index into 
two parts: the cost of basic maintenance to a leaf in the:index tree, and the more 
difficult-to-parameterize costs of node splitting and merging necessary for maintaining 
the index as a balanced tree. This will allow us to calulate the normalized node 
splitting and merging overhead per insertion or deletion (an update can be counted as 
a delete and an insert) in the active indices, which will be used in estimating the cost of 


maintaining an index on a domain which is not indexed in the current interval. 


Query Type Statistics - The type of a query is determined by the set of domains it 
utilizes and by whether it is a conjunction or disjunction: of equality predicates. We 
record the occurrence af each query (this can be encoded-as a. bit pattern) and then 


summarize the occurrence frequencies of each query typefrom time to time. 


Domain Selectivity Statistics - For..each domain, we maintain its average selectivity 
over all uses of the domain in equality conditions in the current interval. This is 
accomplished by recording the number of times the domain occurs in equality 
conditions and the sum of the selectivities of the domain in each of these predicates. 
If an index for the'domain is used to resolve the:partieular equality-condition, then the 


precise selectivity of the domain for this query can be calculated as the fraction of 
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(a) 


tuples in the relation with the domain value in question. If the equatity condition is 
resolved through a sequential scan, the selectivity of the involved domain has to be 
calculated in a reduced tuple space and then extrapolated to the: entire tuple space. 
This is necessary for two reasons: in the’ first place, the scan may be of a reduced set 
of tuples identified through the use of an index (or indices); secondly, we assume that 
the query resolver is efficient. in that it will aveid the wanecessary-checking of tuples 
against equality predicates (i-c., avoiding testing subsequent predicates in a con junction 
once one has evaluated to false, or in a disfunction once dre iis evaluated to: true). 
The estimation of selectivity can be done'as follows: - 


Suppose the equality condition appears in a conjunction of cqnditions of the form 


Cy, alg acrrrrrs AC, 


where each of the C; is an equality condition invelving domain Dj. (We assume that 
the ordering of the equality conditions above reflects: the order of conditions against 
which a tuple is checked.) Let No be the total number of tuples scanned, and let Nj, 
No, eee N, be the number of tuples that satisfy Cy, C,. n'Co, vere, Cc, A Co Av A C, 
respectively. (Note that these numbers are readily available). The selectivity of 


domain D, for this query is then approximated as 


(c) Suppose the equality condition appears in a disjunction of conditions of the form 
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(4) 


C, VCy Ventre Vc, 


where each of the C; is an equality condition involving domain D;. Let No be the 
total number of tuples scanned'and let N), No, ~~“, N,, be'the number of tuples that 
satisfy Cy, “Cy A Co, ~~, #C) A Cy A ~~ AC, respectively. (Again, these numbers 
are readily available). The selectivity of domain Dj for this query is then 


approximated as. 


». (4s 2) 5; 2 Ny/ (No =. Zc Ny) 


Index Accessing Statistics - For each active index, we record the number of times it is 
used for resolving equality conditions and the total cost experided: an such USES. . This. 
allows us to obtain the average cost of using the index. As for a domain that is not 
indexed, we can estimate the number of distinct values in each“as the reciprocal of its 
observed selectivity and use the procedure ‘described in the previous chapter for 


estimating its average accessing cost. 


The foregoing statistics comprise our model of the usage pattern of the data base. The 


frequency count of the query types, together with the index maintenance statistics constitute 


the record of transactions with the data base. By recording: the types of the queries that 


actually occur, we detect any correlations (positive or negative) that may exist between the 


occurrences of different domains in a query (it may happen: that some combinations of 


domains are frequently used together, while others rarely are). Thus we avoid making the 


strong (and often inaccurate) assumption that the sinvultiveous-eccurrences of domains ina 


query are mutually independent events. (Previous studies have made this assumption, and 
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so have recorded access history merely as the frequency of each domain's occurrence in 
queries.) We observe that our measure of domain selectivity serves as a succinct yet precise 
indication of how a domain is actually used in queries. By averaging the selectivities of the 
actual occurrences of a domain, we take into consideration both skewness in the distribution 
of domain values over the tuples as well as non-uniform use of domain values in queries. 
This measurement of selectivity is more accurate than its conventional estimate as the 
reciprocal of the number of distinct values in the domain. Finally, we:note that all.of the 
foregoing statistics can be collected and maintained with very little overhead, either in 
execution time or in storage requirements. All of the required information can be. easily 


obtained during query or transaction processing, and requires little space for its recording. 


2. A 


As we have said, at each reorganization point, we forecast a number of characteristics of the 
system for the interval up to the next reorganizational point. Specifically, we predict the 
following: 

(1) the average size of the relation (number of tuples and number of pages); 

(2) the average selectivity of each domain; 


(3) the expected cost of maintaining an index for each domain; 


(4) the expected storage requirement of an index for each domain; 
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(5S) the expected cost of each use of an index in obtaining the TIDs of those tuples that 


satisfy an equality condition involving the indexed domain; 
(6) the number of occurrences of each query type. 


We could do these projections solely on the basis of statistics collected during the latest time 
period, or we could combine together the statistics collected over all previous periods. 
However, neither alone would be satisfactory for the purpose of a stable and yet responsive 
adaptive system. In the former case, the system would be overly vulnerable to chance 
fluctuations, whereas in the latter case, it would be too insensitive to real changes. A more 
satisfactory approach would be to take into consideration the pattern of change in each of 
these parameters in earlier time intervals in arriving at predictions for their values in the 
upcoming interval. A broad spectrum of techniques is available for the analysis and 
forecasting of time series. However, because of the potentially large number of parameters 
in our cost model, we have to restrict ourselves to those forecasting techniques that are 
efficient in terms of computation and storage requirement. Specifically, we consider here 
the technique of exponential smoothing for our forecasting procedure because of its 
Simplicity of computation, its minimal storage requirement, its adjustibility for 
responsiveness and its generalizability to account for trends. In the following discussion, we 
will refer to the t th observation of a time series (i.e, values of a parameter over successive 


periods of time) as x(t) and the next forecast based on observations up to x(t) as X(t). 


3. Exponential Smoothing 


Intuitively, a weighted moving average strikes a reasonable balance between the two 
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extremes for parameter prediction mentioned earlier. Forecasts derived. by weighing past 
observations exponentially (or geometrically) have been used with some success in operations 
research and economics [Brown59, Muth60, Winters60, Brown62]. The forecast is based on 
two sources of evidence, the most recent observation and the forecast made one period 


before. The exponential smoothing procedure, in its simplest form, is carried out as follows: 
(4.3) ~ FC) = x) 
(4. 4) Rk) = @ X(k) + (1 - @) Rk - 1) 


where e is called a smoothing constant and takes on. values between O andl. A closed form 
expression. for %(k) is 


k-1 k-2 i 
(4. 5) R(k) = x(1) (1 - @) +@ Pots Xk - i) GQ - @) 


In essence, the new forecast is calculated as a weighted average of all previous observations 
with the weight decreasing geometrically over successively earlier observations. The 
compactness of the scheme lies in the fact that only two parameters need to be maintained 
for each time series: the current observation and the previous estimate. Note that equation 


(4.4) can be rewritten as follows: 
(4. 6) Rk) = Rk - i) + a@ (x(k) - Rk - 1)) 


We see that the new forecast is equal to the sum of the two terms: the old estimate and a 


correction term that is proportional to the previous forecasting error (difference between 
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forecast and actual observation). The rate of response to recent changes can be adjusted 
simply by changing the smoothing constant: the larger the smoothing constant, the more 
sensitive is the forecast to recent changes and chance fluctuations. Since the weights given 
to earlier observations sum up to one, no systematic bias is eoadced (i.e. the expected 
value of the forecast is equal to the expected value of the random variable). Hence, this 
procedure is appropriate only for the forecasting of the expected values of stochastic 
variables whose sums do not change between successive periods [Denning71]. If there is a 
long term upward or downward trend in the series, the forecast will always lag behind or 
lead the actual observation. Since we expect to observe trends in various activities in the 


data base, it is appropriate to choose a forecasting technique that can accomodate trends. 


4. Adaptive Forecasting 


This is a variant of the simple exponential smoothing technique that takes trend into 


consideration. Its form is (Theil64) 
(4. 7) R(t) = R(t) + e(t) 


(4. 8) R(t) 


@ x(t) + (1 - @) Q(t - 1) + e(t - 1)) 


(4. 9) e(t) = 8 (R(t) - R(t - 1)) + 1 - B) ett - 1) 


where X(t) and e(t) are the trend and the trend ‘change at time t respectively. (Either an 
additive or a multiplicative trend can be incorporated: ‘the latter through logarithmic 


transformation of the original series.) “To carry out a forecast, we ‘need only the current 
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observation, the previously computed values for the trend, and the trend change, and the 


computation is still very simple. 


The appropriate choices for the smoothing constants « and &, however, is a non-trivial 
problem. It is possible to take a completely empirical approach [Winters60). By maintaining 
the previous values for the time series, it is possible to campare the forecasts made using 
different sets of the parameters a and 8 (One reasonable criterion for comparison may be 
the standard deviation of forecasting error). Winters [Winters60] has suggested the method. 
of steepest descent [Beckenbach56] for finding the best parameters for. a. single series. . This. 
method, however, consumes sufficient storage space and computing time to make its 
application to the large number of series in our system feasible. On the other hand, we 
have no reason to believe that a set of parameters that work best for a particular series will 
work equally well for other series, so that it might not be tee practical to choose the 
optimum weights for one series and use the same weights for. all other. series. Theil and 
Wage [Theil64] have formulated an explicit stochastic model as.a basis for the above 
forecasting method (equations (4.7) through (4.10)). The time series is postulated to be 


generated by 
(4, 10) X(t) = &(t) + w(t) 
(4. 11) g(t) = e(t - 1) + g(t) 


where Xt) is the mean of x(t) and 9(t) is the trend change from period t - | to. period t. (We 
can interpret X(t) and e(t) of equations (4.8) and (4.9) as estimators. of &t) and eft) 
respectively.) The trend change is postulated to be generated by — 
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(4. 12) a(t) = 9(t - 1) + v(t) 7 


where u(t) and x(t) are time series with zero mean, constant variance (2(n) and o2(y) 


respectively) and zero covariance of all kinds. 


For this underlying model, Theil and Wage [Theil60] have found the optimal weights a 
and 8 to be used. Let 


(4. 13) g? = 67) /e* W) 

(4. 14) h? = -g2/8 + g(1 + 8°16) 1/272 
Then the optimal weights for « and £ are 

(4, 15) @ = 2h/(1 + h) 

(4. 16) B=h 


The mean square error of the forecasts is dependent on the choices for a and 8 which in 
turn are dependent on the estimate for the variance ratio g? (ratio between the estimates f or 
e*(v) and ¢4(y)) A sensitivity analysis of the consequences of error in estimating the 
variances ratio (g?) in [Theil60] has shown that a 50% error results in less that 15% increase 
in the mean square prediction error. We can therefore start with a rough estimate of ge, 


determine the appropriate values for a and 8 (or equivalently, we can start with an arbitrary 


choice for @ and §), and adapt these coefficients to updated estimates of the variance ratio. 
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Our application of the above technique in the forecasting of a time series in our system can 


be summarized as follows. 


At initialization, let 


(4.17) X(0) = x (0) 

(4. 18) e(0) = 0 

Att =1, 

(4. 19) X(1) = @ x(1) + (1 - a) X(0) 


(4. 20) e(1) = 8 (K() - X(0)) 

Att = 2, 

(4, 21) ¥(2) =a x(2) + (1 - a) (R(1) + e(1)) 
(4. 22) e(2) = 6 (X(2) - X(1)) + (1 - @) e(1) 
(4, 23) o7(u,2) = (RCL) + e(1) - x(2))? 


(4, 24) e*(¥, 2) = (e(2) - e(1))4 
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(#%(u, t) and 7, t) are estimates of the variance ¢%(y) and #4) at time t) 
For t > 2, 

(4.25) R(t) = a@ x(t) + Gi - @) (R(t - 1) + oe - 1)) 
(4.26) e(t) = @ (R(t) - R(t - 1) + (1 - —) e(t - 1)) 


(4.27) «7G,t) = C(t - 3) e2G@,t - 1) + 
(x(t) - 8 - D -et - DY7E- DD 


(4.28) 2, t) = ((t - 3) 67@t - 1) + (e(t) - eft - 120247 - 2) 


We begin by using arbitrary vatues for « and 8: As new estimate for the variance ratio g? 
becomes available (from the ratio of #(¥, t) to (uy, t)) , we cari adapt the values for a and 8. 
(Note that the amount of information that has to be passed on-from one interval to the next 
is still quite small, and the computation needed to choose the appropriate weights is 


minimal.) 


5. Parameter Forecasting 


Using the foregoing techniques, we can summarize our forecasting procedures for upcoming 


parameters as follows: 


(1) average size of the relation - we can use the current size of the relation as our current 
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(2) 


(3) 


(4) 


observation, forecast the size of the relation at the end of the upcoming interval, and 
use the average of the two as the average size of the relation over the upcoming 


interval. 


maintenance cost of each domain index - if the domain is indexed in the previous 
interval, then its actual maintenance cost can be used as the latest observation; 


otherwise we can use the estimated cost.as described earlier as the latest evidence. 


number of occurrences of each query type - if an observed query type has no previous 
forecast, then we will use the observed frequency.as the next, forecast and treat this as 
a new series to be forecasted. 


average selectivity of each domain, storage requirement and average accessing cost of 
each domain index - we note that our estimates for the current values of these 
parameters in the case of non-indexed domains.are rather crude, hence we will 
reinitialize the forecasting procedure for each “newly” indexed domain, ie., if. a domain 
is indexed in the most.recent interval but not in. the one before, then we will use the 


most recent observation as the sole evidence in the forecasting of these parameters. — 
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CHAPTER 5 
INDEX SELECTION 


A straightforward approach to the index selection problem would be to evaluate the 
projected total system cost for each possible index set (using equation (3.16), and then select 
that set of domains which gives the smallest cost. With m domains in the relation, there are 
_ 2™ possible choices of index sets. For small m (say less than 10), this enumerative approach 
may be feasible for finding the optimal combination of domains to be indexed. However, 
_ because of the exponential rate of increase of the number of possible index sets with the 
number of domains, the search space becomes prohibitively large very rapidly. (With 30 
domains, there are more than 10° index sets to be considered. The cost of exhaustively 
exploring the search space may no longer commensurate with the profits that can be 
gained.) Yet, it is not uncommon to find single-relation data bases with tens of domains. 
Therefore, it is appropriate to look for ways whereby the search space of potential index 
sets can be systematically reduced. One possible approach is to look for properties of the 
cost function that will allow it to be minimized without exhaustive enumeration, such as 
through a depth-first search, as exemplified in Schkolnick’s index selection study 
(Schkolnick75]. However, these properties depend upon unrealistic assumptions that domain 
occurrences are uncorrelated and that the tuple acess‘ cost function is linear; and even so, 
the associated upper bound of gm? log ™ index sets to be tested is not enough of a 
reduction to enable the inexpensive selection of the optimal index set for a relation with a 


moderate number of domains. 


Wheh we remove the above two assumptions, the computation needed to evaluate the utility 


of a proposed index set becomes dependent on the ‘humber of distinct’ query types 
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forecasted. (All told, there are (2™ - 1) possible conjunctive query types (which specify 1 or 
more domains) and (2™ - m - 1).disjunctive query types(which specify 2 or more domains 
for a total of (2%! - m - 2) possible query types.) Except in cases when only a few of the 

large number of potential query types actually occur, the evaluation of the cost-effectiveness 
of a particular potential index set is quite expensive. Hence, we have a strong incentive for 
systematically reducing the search space for the optimal index set. Yet, because of our lack 
of simplifying assumptions, the hope of finding an alggrithmic way to explore a. reduced 
search space of practical size and. still finding the aptimum. is. dim. Therefore, it is 
appropriate to draw on the experience of artificial, intelligence researchers working in areas 
where formal mathematical structures are computationally impractical, and use. heuristic 
methods [Feigenbaum63, Meier69] that significantly prune down. the search. space. and that 
work towards obtaining a near-optimal solution. 


1. Index Selection Heuristics 


In this section, we examine the structure of the index selection problem and describe a 
number of ways in which the index selection cost can be reduced. 


(1) Not all queries can use indices profitably. The expected set of tuples that satisfy a 
query may be so large (i.e. the qualified tuples are-tikely to reside on close to p pages) 
that no set of indices could possibly be useful in its processing. Since the cast of 
computing the utility of a proposed set of indices is dependent on the total number of 
queries under consideration, those queries that cannot profitably make use of indices 
should be removed from the projected query set whose. processing cost is to be 
minimized. This can be done by computing the processing cost of each query given 
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(2) 


the availability of indices on all domains that are used in the query. If this is more 
expensive than a sequential scan, then the query should be removed from the projected 


set of queries. 


Some domains can be eliminated from the initial candidate set by virtue of their low 
occurrence frequencies in queries. This effectively reduces m, the initial number of 
domains in the candidate set. Using the forecasted frequency of each query type, and 
the projected selectivity of each domain in the relation, we can compute an upper 
bound on the number of page accesses that the use of an index on a particular domain 
can save in the processing of the forecasted ‘set of queries.. If this upper bound is less 
than the projected cost of maintaining an index on the domain, then this domain can 
safely be excluded from the initial candidate set, i¢., the domairt is so unselective or is 
used in retrievals so infrequently relative to its being updated, that it cannot possibly 


be profitable to index it. 


The upper bound on the utility-of an index‘for an arbitrary domain i is computed as 
follows. Let q be a conjunctive query type that involves domain i, and let Sq be the 
joint selectivity of all the domains of q. Then the tuples that satisfy q are expected to 
reside on f(Sgan) pages, where nis the total number of tuples in the relation and f is 
our non-linear function for expected page accesses. So an upper bound on the benefit 
that an index on i could possibly bring to the evaluation of q would be to reduce the 
number of pages to be accessed from p to fS_un). (A similar formula holds as well for 
the maximal reduction where q is a disjunctive query, but with Sq there representing 
the joint disjunctive selectivity of the domains used in-q.) In the case of a conjunctive 


query, an additional upper canbe computed which in some cases may be tighter than 
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(3) 


the one just mentioned. Note that an index on domain i with. selectivity S; reduces the 
number of tuples that have to be examined to resalve.a conjunctive query involving 
domain i by a factor of S;. However, because of the convexity of our. tuple access cost 
function, a reduction by S; in the number of tuples to access leads to less than a 
reduction by S; in page accesses. Hence the maximal: incremental saving (in terms of 
page accesses) cannot exceed pa{l - S,). Thus the upper bound. on the utility of an 


index for i is: 


(5. 1) Fue Q,Fq* ¢ (1-T,) amin (ps (1-55), pa (i-£ ( (Dy, B,°3 an)) + 
Tgape(i-£( (1-H, D, G-S5) )en)) 


where Qj; =. set af forecasted query types that use domain i. 
prajected. number of occurrences of q. 


set of domains referenced in q 


“1 
u 


a 
a 


q 7 0 if q is conjunctive and 1 if q is disjunctive 
total number of tuples in.the relation 


rh s 
a 


non-linear tuple access cost function for the relation 


Some domains could be known te be included in the optimal index. set by virtue of 
their high occurrence frequencies in queries. For each: domain, we can compute a. lower 
bound on the savings in query processing its indexing can: bring. If the latter is less 
than the expected maintenance cost, then the domain. must be included in the. optimal 
index set. In cases where a domain is used together with: others. im a query, it is very 
difficult to assess the lower bound..on: the utility of the: indes.. Therefore, we will 
compute.the lower bound for a domain. based.oaly. on those-queries in which the 
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domain occur alone. 


(4) A near optimum choice of the index set can’ be made incrementally. This heuristic 


permits analysis of the problem as a stepwise minimization, each time adding to the 


index set that domain Which will bring the best improvement to the cost function. 


There have been two previous suggestions regarding the incremental selection of 
domains to be indexed. Farley and Schuster [Farley?5) suggest that the incremental 


selection process can be terminated once no single domain in the non-indexed set can 


be chosen that wifl yield inctemental cdst/benefits. This ts insufficient for our choice 


of query and tuple access models; there aretwo reasoris why it may be necessary to 
consider the incremental savings brought by adding’ two-or more indices together to 
the index set candidate. “First, it may tiappen that’ for a query involving a conjunction 
of conditions, the selectivity of any one domain may not’ be sufficient to reduce the 
number of pages to be accessed to significantly less than the total number of pages in 
the relation, whereas the joint selectivity of two.or more domains might. (Recall that 
the reduction must be significant in order to cover the index accessing cost.) Secondly, 
a disjunction of conditions can be resolved via indices onfy if aft of the domains 
involved in the disjunction are indexed. An altérnative strategy has been suggested by 
Held (Held75b), who, at any stage of the incremental index selection procedure, 


considers the incremental savings of each of the possitle subset of domains in the 


candidate set with fess than or equal to ‘some fixed number of domains in it. ‘This, of 


course, may be very inefficient. We have taken an intermediate approach. We 
consider the adjoining of multiple domains to the index set only if no single domain 


that will yield positive incremental savings can befound. 
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($) Only a small subset of all possible candidate domains need be considered in 


(6) 


determining the next domain or set of domains to be ad joined to the index set at each 
stage. We can rank the domains with respect to the maximal savings each can bring 
and then consider only the top ranking M domains, and cambinations of them, for 
detailed incremental savings calculation. (Alternatively, we can take the maintenance 
cost of each into consideration and divide the maximal savings of each index by its 
maintenance cost before doing the ranking.) Furthermore, a bound M’ (M’sM) can be 


imposed on the number of. domains that will be considered together. 


An upper bound can be put on the total number of cast evaluations (i.e, the total 
number of index sets considered) to be performed.in the. entire selection procedure. — 
Also, an upper bound can be put on the maximum size of. an index. set that will be 
considered. The incremental selection. procedure. will be terminated when. either of 
these bounds is exceeded. 


2. Index Selection Procedure 


To illustrate the above heuristics, we present the details of our.index selection procedure. 


Our pracedure can be divided into three phases. 


Phase | (Initialization) - During this phase, a tentative index set is chosen to include all 


those clearly profitable domains, and a ranking of the.domains that might be profitable to 


ad join to the tentative index set is computed. This involves.the following steps: 


(a) 


Remove from the projected set-af queries all those. that cannot profitably make use of 
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(b) 


(c) 


indices. 


For each domain, compute a lower and upper bound on the savings an index on the 


domain.can bring. 


Partition the set of domains D in the relation into three’ disjoint subsets: D, - the 
tentatively chosen index set, De - the candidate set, and Da - the non-profitable set. 
Initialize D, with those domains whose maintenance costs are less than the 
corresponding minimal savings they can bring, D,, with those domains whose 
maintenance cost exceeds the corresponding maximal savings they can bring, and D. 


with D - D, - Dy. 


(d) Rank the domains in D, with respect to theit estimated utility. 


Phase 2 (Incremental Selection) - The tentative index set is enlarged by the adjoinment of 


domain(s) to it incrementally. 


(a) Consider in turn the incremental savings gained by indexing each of the M top 


ranking domains in the candidate set (ie, for each of these domains d, compute the 
cost associated with D, + d, and compare it to the cost associated with D,). Ad join to 
D, that one which will give the best improvement to the cost function. If one cannot 
be found, then consider larger-sized combinations (up to M’) of these M domains. 
Consider combinations of the next larger size only if it is not profitable to adjoin any 


of the combinations of the current size. 
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(b) Remove the domain(s) from D, as they are ad joined to D,. Resume considering 


individual domains for further adjoinment after an ad joinment to D,. 


(c) Terminate the incremental selection if no subset (of size less than or equal to M’) of the 
M top ranking domains in the candidate set can be chosen such that its adjoinment to 
the index set will improve the index set's cost function; .or if the upper bound on the 


total number of cost evaluations is reached. 


Phase 3 (Bump-shift [Kuehn63]) - Domains that have been adjeined to the tentative. index 
set early in the incremental selection phase may: urn owt to be uneconomical as the result of 
later addition of other domains to the set, and thus should be removed:from the index set. 
Since the probability of the need for the simultaneous removal of more than one domain is 
quite small, we will only consider the removal of individual domains. (The: necessity to 
remove two domains from D, implies that some of those queries whose processing costs are 
Significantly improved by the initial ad joinment of these domains. to D,, become: less 
dependent on them as other indices become available. The fact that it is not profitable to 
remove one of them alone implies that there are some queries which depend on both of 
them, and whose processing costs are improved, in. the presence of bash indices; by more 
than the maintenance cost of either one, but less than the maintenance cost af both. Such a 


combination of circumstances is rare enough for us to ignore it) 


(a) For each domain d tentatively assigned to the index. set, subtract the total cost for D, 
from that for D, -d. Remove from D, that d for which the above difference is largest. 


(b) Repeat the process until no domains remain in D, whose removal would improve its 
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cost function. 


In order to assure that we have a real local optimum, we may go back to the 
incremental setection phase after some domains have been removed from the tentative 
index set. To guarantee that the process terminates, we would put a domain d into D,, 


if it is removed from D, by the bump-shift phase. 
3. Performance of Index Selection Heuristics 


We have discussed a number of ways in which the index selection problem may be 
simplified. The initialization phase of our heuristic selection procedure leads to a reduction 
in the search space for the optimal index set and a reduction in the total number of queries 
that have to be considered under any proposed itidexing policy, without jeopardizing the 
possibility of finding the real optimum index set. On the other hand, when we make use of 
the heuristics of stepwise minimization and of considering only thé top ranking domains for 
incremental selection, we have opted for a good solution at reasonable cost rather than the 
optimal solution at any cost. There are several reasons why the stepwise minization 


procedure should be good. 


(a) It resembles the methods that might be employed by an intelligent human being in 
solving the index selection problem. For any tentatively chosen index set, we know for 
sure that the cost of maintaining these indices is less than the savings that they bring. 
Furthermore, the total system cost is monotonically decreasing as successive domains aré 
added to (during incremental selection) and removed from (during bump-shift) ‘the 


tentative index set. 
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(b) 


(c) 


It has been successfully applied in problem areas of a similar nature. In [Kuehn63}, 
stepwise minimization was applied to the problem of choosing the sites for warehouses 
from a number of potential sites which minimize a particular cost function. The 
problem is in many respects similar to the index selection, especially in the fact that for 
each potential warehouse site, there is the possibility of having or not having a 
warehouse at that site, just as for each domain, we have the passibility of having or 


not having an index on that domain. 


It actually finds the optimal index set under certain Giesmaiances It is provably 
optimal if only single domain queries and/or disjunctive queries are present in the 
projected query set. In such a case, it is impossible for the heuristic algorithm to 
choose an index set that is a subset of the optimum, since it considers ad joining 
combinations when necessary; also, it is impossible for the heuristic algorithm to 
include in its choice a domain that is not in the real optimum index set. (The fact that 
a domain has been ad joined to the heuristically chosen index set means that there is a 
set of queries which depend on the availability of the index in question in order to be 
resolved using indices, and that the savings from processing these queries using indices 


more than pay for the maintenance cost of that index.) 


In the presence of both conjunctive and disjunctive queries, it is possible that the 


heuristically chosen index set can depart significantly from the optimal index set. However, 


we can argue that the probability of this occurring is quite small, and even if it this does 


occur, the total system cost under the heuristically chasen set of indices.may not be too 


different from that under the optimal index set. _ 
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Let Dopt and Dpeyr be the optimum index set and the set chosen by the heuristic index 
selection procedure respectively; then we have the following circumstances in which Dy our 


will be non-optimal. 


(a) Dopt strictly includes Dy ..,, - This is highly unlikely, since we do consider the 


ad joinment of multiple domains (up to a certain bound) to the tentatively chosen index 


set if no simpler ad joinment is profitable. 


(b) Dheur Strictly includes D,,, - Because of our bump-shift procedure, we know that 


pt 


De ae must include two or more domains that are not in D and as discussed in the 


opt’ 
previous section (on the bump-shift procedure), this is very unlikely. 


(c) There are domains in D,,,, which are not in D,,,,, and vice versa - This is probably 


P 
going to be the most common. The fact that domains which are in Dopt - Dpeur 2re 
not ad joined to D;,,,, implies either that they need to be simultaneously indexed to be 
useful and their total number exceeds the bound on the number of domains that the 
heuristic index selection procedure will consider for simultaneous ad joinment, a 
possibility which is quite remote; or that indices on them are no longer useful in the 


presence of domains in Dyey, - in which case the total system cost for Dy eur 


Dopp 


may not be too far away from that for Door 
We have performed a limited amount of experimentation with the above heuristic 
algorithm, applying it to a number of access histories, and comparing its results to those 
obtained by an exhaustive consideration of all possible index sets. In the cases that we have 


tested, the heuristic algorithm has almost always found the optimal index set at a small 
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fraction of the cost of the exhaustive procedure. Most of the increments to the tentative 
index set consist of single domains, so that the total number of index sets considered only 
increased only linearly, instead of exponentially, with the total number of domains in the 
relation. Moreover, the bump-shift phase seldom yielded an improved choice over that 
given by the incremental selection, which: in many cases was already identical to the choice 
given by an exhaustive search and therefore optimal. 


4. On Further Reducing the Cost of Index Selection 


The main thrust of the heuristic index selection algorithm described in the previous section 
was towards reducing the search space far potential index sets by making the selection 
procedure an incremental one. However, in addition to the need for cutting down the index 
set search space, there is also the need to minimize the cost of assessing the cost/benefits of 
each individual index set. By making forecasts of query type occurrence frequencies based 
on past observation, we have thus far avoided the strong assumption that individual 
domain occurrence probabilities in a query are independent. In consequence, however, our 
scheme requires that in considering each possible increment to the index set, we evaluate the 
costs of processing each of the projected query types that inyolves any of the domains in the 
increment. The number of passible query types is an exponential function of the number of 
domains in the relation; so the number of query types. that actually occur is also likely to 
increase quite rapidly with the number of domains. There may be as many as 2™ - onrk 
conjunctive query types that will require individual computation (for new processing cost), 
where k is the size of the increment under consideration; these are those queries that use at 
least one of the domains in the proposed increment. One possible simplification is to graup 
queries together and to characterize the group in. terms of a small number of statistical 
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properties. Instead of finding the savings in the processing of each of the queries that are 
affected by a proposed increment, we can compute the savings for the group as a whole, 
which can be done more efficiently. In the fallowing sections, we will examine one query 


. grouping scheme that has been suggested previously and suggest extensions to it. 


In [Schkolnick75] (who considers only conjunctive queries), all queries are put into a single 
group which is described by the query occurrence probabilities of each domain, (i.e, the 
fraction of queries in which the domain is used). Furthermore, these probabilities are 
assumed to be independent. For example, for a relation with three domains a, b and c, each 
with occurrence probability P,, P,,, P. respectively, the: probability of having a query that 
involves just the domains a-and b is assumed to be 


(S. 2) Pa * Ph ® (1 Py) 


since P, and P) are the occurrence probabilities of domains a.and b, and | - P, is the 
probability of domain c’s non-occurrence. For-a proposed index set D, the total query 


processing cost.can then be computed as follows. 


Let Ne = total number of tuples 
Ng = total number of queries 
Q = set of all possible queries 
AS; = average selectivity of domain i 
AC; = average access cost for index on domain i 
P,, = occurrence probability of query q, qeQ 
Dy = domains specified in query q, q€eQ 
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Fy = cost of accessing the set of tuples to resolve q using index set D 


cost of accessing (Th ¢p nD AS;) # N, tuples 
Cy = processing cost of query:q with index set D: 
i ed4nd ACi) + Fy 


then the total query processing cost is 


(S. 3) No * Zucq Pg * Sq 


With m domains in the relation, there are 2™ possible queries. However, in evaluating the 
utility of an index set of size s, only 2° distingaithable sub-groups of queries need to be 
considered. (A distinguishable sub-group of queries consists of all those queries with the 
same expected processing cost under a given set. of indices. Given an index set D, two 
queries fall into the same sub-group if they use the same set of domains in D, since their 
processing will involve the use of the same set of intices, resulting: in:the accessing of sets of. 
tuples of the same expected size) Consider the above $-demain relation and the index set 
which includes only domain a, then the possible queries in the group can be divided into 
two sub-groups, those that specify domain a and those don’t. In general, it is necessary to 
evaluate the processing cost of each of these distinguishable sub-groups individually before 
an expected processing cost for an average query can be computed. However, by assuming 
that the tuple access cost function is linear, a further simplication ‘results in the following 


total query processing cost 


(5. 4) Ng td (Bi ep Py) & AC, + Then (1 - Py + Py ® AS,)) 
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The above formula admits of the following simple interpretation: for an average query, 
with probability P,; domain { in the index set'D is specified, iri which case the fraction of 
tuples that have to be examined is-reduced by AS}; and with probability 1 - P; domain i is 
not specified in which case an index on domain-i does not lead to’ any reduction in the 


number of tuples to be examined. 


We thus see that Schkolnick’s scheme leads to a very simple computation for the evaluation 
of the utility of an index set. However, the simplifying assumptions that lead to this 


computational simplicity are not altogether realistic. © 


5. Query Clustering 


We feel that the idea of grouping queries is fundamentally sound, since it significantly 
reduces the number of query types that have to be considered at each step of the 
incremental index selection procedure. On the other hand, grouping can lead to loss in 
correlation information. For example, again consider the above $-domain relation: it may 
happen that domains a and b never appear together in queries, whereas the independence 
assumption will lead us to assume that a query that specifies only domain a and domain b 
does occur with probability P, « P, « (1 - P.). In order to preserve the correlation 
information, we should only group similar queries together. Hence, the division of the 
queries into more than one group may be necessary. Since some correlation information is 
inevitably lost when queries are grouped together and it often happens that some queries 
occur quite frequently while others only rarely, we may want to consider'the most frequently 
occurring queries individually, in the process of incremental indexing utility calculation, 


while grouping the less f requent ones into one of more groups. 
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To incorporate the above scheme, the evaluation of the-utility of any proposed increment to 
the tentative index set can be modified.as fallows. The incremental savings afforded by the 
increment ta each of the frequent (nen-grouped) queries, is. computed.as before. As for each 
of the query groups, we compute the improvement. to.each. distinguishable sub-group of 
queries that is affected by the increment. The improvement to.the group is then computed. 
as a product of the total number of query occurences in the group and the average 
improvement to a query in. the group. The. latter is. obtained. from the sum of the 
improvements to each of the distinguishable subgroups, weighed by their individual 
occurrence probability with respect to the group. 


The clustering scheme we suggest for the less frequent queries, is-of the “nearest-centroid" 
type [Belford75]. (This involves the definition of a metric or a measure for the distance 
between queries and groups of queries. The centroid of a group may be looked upon. as an 
average (or representative). query in the group.) Since the cost evaluation process at each 
Step of the incremental index selection procedure is dependent on the number of query 
groups we have, we may a priari determine the number of graups (say G) into which the 
less frequent queries are to be divided. A possible clustering strategy is as follows. We rank 
the less frequent queries in terms of. their occurrence frequencies, and start off with groups 
that are singletons of the G top ranking queries. The remaining, queries are considered 
sequentially; each is added to the group with the nearest centroid, after which the centroid 
for the affected group is recomputed. 


For each query group, we maintain its centroid and the total number of query. occurrences in 
the group. We represent a query by means. of a binary. vector which indicates the domains 
that are used in the query and the centroid of a group of queries by. means.of a vector that 
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indicates the occurrence probability of the individual domains with respect to the group. 


vector representation of a query group g 


g 
q * vector representation of a query q 


V 
V 
F 


8 total number of query occurrences:in g 
Fy 


total number of occurrences of q - 

The distance between q and g can be computed as 
| (S. 5) VWVq - Vgll = By Woy - Vogt 

When q is added to g, the centroid of the group is recomputed as 


(S. 6) V. « (F. « V. + F * FL)/ (Fy + F 


g © Fy * My t Fe ra 


and the total number of query occurrences in the group is updated as 
(S. 7) FL « F 


In order to evaluate the utility of a proposed index set with respect to a given group of 
queries, we need to have a scheme for the assignment of occurrence probability to each 
possible query in the group. One possibility is to use the independence assumption 
discussed previously. However, this resutts in:the assignment of a non-zero probability to 
the query that specifies none of the domains, whichis inadequate since we never include the 


query that specifies no domain in our grouping scheme. Therefore, we need to have a 
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scheme for the normalization of probability assignments. In addition, we might want to 
take into consideration the complexity (number of domains specified) of the component 
queries of the group. For example, if all of the component.queries in the group involve say 
two domains, then we should discount single-domain or more-than-two-domain queries in 
the probability assignments. In view of the abeve two considerations, we can keep track of 
the number of query occurrences for each complexity in the precess of adding queries to a 
group, and use the following normalization scheme. 


Let Ng total number of query occurrences in the group 


Lay 
h 


total number of domains with non-zero occurrence probability 
P, = occurrence probability of the i” domain 
NC, = number of query occurrences with complexity i 


The conditional occurrence probability of a query q, which uses domains in Dy given that 


the query is of complexity C,, can be computed as the product of the occurrence 


q 


probabilities of domains in D,, normalized by the sum of products of probabilities of ali 


¢; 
non-zero-occurrence-probability domains in the group, taken Cy ata time. The above 
normalization factor for queries of complexity i can be shown ta be the coefficient of x! in 


the following expansion [Liu68} 
Theice (Ll + Pye 


Hence, the unconditional proability of having a query q which uses the set of domains in 
Dg and of complexity C, can be computed as 
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(5.8) (NC ING) yen Py) /NFe? 
) Cg * Ntedg F1)/ Cg 


Note that the number of distinguishable sub-groups in a query group with respect to an 
index set (and hence the cost of indexing utility evaluation) depends on the number of 
domains with non-zero occurrence probability (with respect to the the group) that the index 
set contains: (The adjoinment of domains with zero occurrence probability in the group to 
the index set will not affect the processing of the group.) Therefore,-an alternative to the 
above strategy of a priori deciding the number of groups to have is to limit the number of 
domains with non-zero occurrence. probability in each group. In attempting to add a query 
to one of the existing groups, we can take into consideration both its distance from the 
group and the number of domains with non-zero occurrence probability in the resulting 


group, and create a new group if necessary. 


74. 


CHAPTER 6. , 
SUMMARY AND FUTURE RESEARCH 


The research reported in this thesis has been motivated by the need for intelligent data base 
management systems, to support large integrated data. bases. .We have proposed a 
methodology for the incorporation of  optimiaation..and. self-erganization.. capabilities. into. 
data base management system. Specifically, we suggest the follewing approach: _ 


(1) the development of = accurate cost model. that.clesely. reflects.the: data base 
environment and data base system operatiog. this cost.model:is. to be: used both by-the- 
query processor for-selecting. the most econamic.access.path for a given query and by 
the reorganization component of the system for.the selection of. a.neay-optimal physical 
data base organization for the observed access pattern), 


(2) the monitoring of accesses to the data base that allows the system to build up an 


accurate model of the contents of the data base and the way that the data base is used; 


(3) the application of forecasting techniques to detect and respond to changes in access 
requirements and data characteristics; 


(4) the design of computationally feasible heuristics that select a near-optimal physical 
organization at a reasonable cost. 


We have applied the foregoing steps to the index selection problem and have achieved a 
design for the incorporation of an adaptive index selection capability into a dynamic, single- 
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relation data base environment. In the following sections, we will summarize the novel 


aspects of our approach and suggest possible extensions to it. 


1. Comparison with Previous Work 


Our experimental and heuristic approach to the index selection problem is different in 
many respects from previous studies by Stonebraker {Stonebraker74], King [King74], 
Schkolnick [Schkoinick75], Farley [Farley75], and Held [Held75b]. These other studies have 
either been formal analyses, which have made many simplifying assumptions in order to 
obtain an analytic solution, or else system designs that have been incomplete or unrealistic in 


various ways. 


Our work attempts to go farther than these by utilizing more complete and accurate models 
of cost and access, and by emphasizing important aspects of realistic data base 
environments. Our model of tuple access is realistic in the sense that we take into 
consideration the blocking effect of tuples on secondary. storage devices. Our cast models 
for data base access and maintenance account for such real overheads as the expense of 
index accessing and the cost of maintaining the index as a balanced tree. Our approach of 
minimizing the total processing cost for the upcoming interval, rather than the expected cost 
for a single data base transaction, is flexible enough to account: for the overhead costs of 


index creation, index storage, and application program retransiation. 


We have stressed the importance of accurate usage model acquisition and data characteristic 
estimation in a dynamic environment where access requirements are continually changing. 


Our scheme endeavours to obtain a precise model of data base usage by recording actual 
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query patterns, thereby avoiding the strong and often inaccurate assumption that domain 
specifications in queries are uncorrelated. We alee take into: consideration the facts that 
values of a domain may not be equally used in queries and that they may not be evenly 
distributed among tuples of the relation, by monitoring the aetual selectivities of the domain 
values that are used in queries. On the other hand, we have also made sure that our 
schemes for gathering statistics during the processing of data base transactions have as little 
effect on system performance as: possible. 


We believe it. necessary to apply forecasting techniques to past- observations and predict 
future access requirements and characteristics, in order to capture and respond to the 
dynamic and changing nature of data base usage. In the selection of applicable forecasting 
techniques, we have stressed the importance of minimal storage requirements, simplicity in 


computation, responsiveness.and adaptability. - 


Finally, the size of actual data bases is reflected in. our: concern: for ‘efficient heuristics to 
speed up the index selection process. Our scheme for the grouping of queries allows us to 
reduce the index selection cost and yet preserve the influence of demain: correlation on the 
selection procedure. 


2. Directions for Future Research 


There are numerous optimization opportunities in a complex data base environment. In 
this thesis, we have addressed the optimization issues related to the choice of indices to be 
maintained and the strategy for using these indices in query processing. By way of 


conclusions, we suggest several directions in which our work can be extended. 
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(1) 


(2) 


There are many separate issues that need resolution in the selection of physical 
organization for a general integrated data base, including method of placement of 
records on secondary storage, primary access mechanism, auxiliary access aids, 
clustering parameters etc. Within a single-data base environment, an organizational 
issue that might be profitable to consider in conjunction with the selection of indices is 
the division of the stored representation of the relation into a number of subfites, each 
consisting of subtuples containing only a subset of ‘the fields in the relation. The 
purpose of such an organization is to limit the amount of irrelevant information that 
is accessed, when the qualification and output parts of a query involve only a small 
number of domains in the relation. Previous studies [Kennedy?2, Stocker7$, Hoffer75] 
have considered this file partitioning problem in the absence of auxiliary access aids. 
An adaptive strategy towards the simultaneous selection of indices and ‘file partitions 


might be fruitful. 


Even though our investigations into index selection are in many respects more 
comprehensive than previous studies, we have considered only the environment of a 
single-relation data base accessed through a restricted interface with limited capabilities 
for the selection of data. To fully realize the flexibility of a relational data base, it is 
necessary to consider a multi-relation environment together with’ a high-level non- 
procedural language interface that permits queries with arbitrary interconnection 
betweeri relations in the qualification part-and higtr level operations on the qualified 
data. In such an environment, it is necessary to consider the utility of indices for more 
complicated operations (such as restriction, projection, division, join, etc. [Codd70, 
Palermo72, Smith75, Rothnie75, Pecherer75, Wong76)) and ‘to select indices for all the 


relations in the data base as a whole. The recording of detailed access history will be 
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(3) 


(4) 


(a) 


necessary for optimal index selection in this environment, and the use of heuristics 
should be fruitful in cutting down the search space and for selecting richer index 
structures (such as. combined indices). 


We have proposed that an intelligent data management. system. should build up a 
model of the contents of the data base and the way that it is used. Such information 
can be used for the evaluation of costs of akernative access paths for the processing of 
queries. In additien. to individual. query optimization and global choice. of optimal 
physical organization, a query cost estimator can find yet. another application in large 
integrated data bases. It is all too easy for a naive data base user to.ask a simple-to- 
phrase query that will take a great deal of computational resource and: time ta. answer. 
Frequently, the value of this information to. the requestor will not be commensurate 
with the resources expended to obtain it. If a cost estimator is available at the user 
interface, a user can obtain an estimate of the cost of answering his query and then 
decide to pay the price.and have it answered, or ta.cancel the query. More. work an the 
development of cost models for complex query processing, and schemes for the 
acquisition of the necessary parameters, in order to provide such.a facility. 


We have applied forecasting techniques to the prediction of upcoming access 
requirements and data characteristics. In a truly adaptive. system, higher level adaptive 
mechanisms will also be necessary. Levin [Levin75] has suggested the following 
hierarchy of adaptive mechanisms to be employed in.an uncertain environment. 


a forecasting mechanism that performs prediction of various parameters in. the system 
based on past observations; 
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(b) a parameter adaptive mechanism that for a given forecasting technique chooses the 


best values for the basic parameters of the technique. 


(c) a meta-adaptation mechanism that automatically switches from one forecasting 


technique to another based on their individual performance. | 


The adpative forecasting procedure we have described actually encompasses the first 
two mechanisms. To incorporate the meta-adaptive mechanism for a particular time 
series involves keeping around the entire series (or at. least the.most recent portion) and 
comparing the amount of potent error that would have been resulted from the 
application of each of the f orecasting technique under consideration. The large 
number of parameters that we utilize preclude the application of any meta-adaptation 
mechanism to each of them. On the other hand, a selective application of such a 
mechanism to parameters to which the cost function is most sensitive may be 


appropriate. 


We have assumed that reorganization is to be considered at fixed intervals, the length 
of which are to be determined by the data base administrator. Since the overhead of 
index selection is incurred at each reorganization point, it would be desirable to have 
the system automatically ad just the intervals between reorganization points to suit the 
rate of change in access pattern and the degradation of system performance. More 
fundamentally, an intelliggent adaptive system must assure that the payoff of the 
adaptive mechanisms is commensurate with its overhead costs, and “switch if off” 


when the usage requirements reaches a steady state. 
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PROOF OF EQUATION (3.3) 


Consider m tuptes T;, To, ““", T,, to be placed into n equatty likely stots that are partitioned 
into p blocks of t slots each (n = pet). Let-p(r) be the number of blocks that contain T}, To, 
~~, T,. Define 


(Al. 1) p(0) = 0 
{A1. 2) a(r) = p(r+i) - pr) 


then p(r) and d(r) are random variables, and d(r) takes on values 0 or |. Let f(r) be the 
expected value of p(r), then we have the f ollowing recurrence relation: 


(Ai. 3) £(0) = 0 


(A1.4) f( (rel) - f(r) = E{d(r))} 
= Prob [d(r) 21] 
= © Prob [d(r)=1 | p(r) sk] Prob [p(r) =k] 
k 


k-P nek t 
D > 


Prob ([p(r) =k] 
k=0 nh-f 


k=p 7 a kt 
zs = ( - } Prob {p(r) =k] 
k=0 n-r h- rf 


n 


n-r- 1) 
(A1. 5) f(rel) 2 ————_ f(r) + 
n-T n-r 
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A closed form solution of the above difference equation can be obtained as follows. Let 


(A1. 6) sexn-r 
(Al. 7) r =n - §,then 
i 
(A1. 8) = s f(n-sei) x5 
$=0 
oo eo 
= 2 (s - t) f(n-s) x5 + f n x8 
s=0 $20 


With some manipulations of equation (Al.8) we have 


(AL. 9) Z (s - 1+ 1) £(n-(s-1)) x8 
s=0 


zs 2 (s - 1+ 1) f(n-(s-1)) x>> 
S21 


i) eo 
s 2 s f(n-s) x8 - = t £(n-s) x5 + EF n x8 
s=0 s=0 s=0 


Considering the second equality sign in equation (Al.9), we get 


[++] t+) 
(Al. 10) x2 f(s - 1) £(n-(s-1)) x8°2 + x BE £(n-(s-1)) x57! 
$2] s=1 


+x & £(n-(s-1)) x57} 


ao o 
=x E gs f(n-s) xSt-~t 2 £(n-s) x8 +n E x5 
: $=0 . s=0 s=0 
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Let 
eo 
(A1. 11) F(x) = 2 f(n-s) x5 
sz=0 
ea 
(Al. 12) F'(x) = EZ 5 £(n-s) x5! 
s=0 


then from equation (ALI0) we get 


(Al. 13) x? F'(x) + x F(x) = x F'(x) - t F(x) + : , or 
-x 


t+x 
(Ai. 14) F* (x) - —————— F(x). = - 


xa-» x a - x)? 


Equation (A114) is a linear first order differential equation, and has the following general 


solution 


xt Ga - xt 
(A1. 15) Po) (« Pa ace ie ax) 
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From equations (AI.7) and (A1.8), f(r) = f(n-s) = coefficient of xS in F(x). Letting 


(Al. 16) s=t+k 


= Nn - r, we have 


o (/t+k n ” 
(Al. 17) F(x) =c 3 ( )x +- = x5 
s=t \ t t s=0 
n-r n 
(Al. 18) f(r) =c ( ) + 
t t 


n n 
(Al. 19) e (Meteo 


(Al. 20) cz- 
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Substituting this for c in equation (A119) we have, 


(Al. 21) f(r) =so-- 
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APPENDIX 2 
ANALYSIS OF SORTING COST 


The sorting of pairs of domain values and tuple identifiers forms a key step in the creation 
of an index. For typical file sizes in a data base environment, an external sorting is 
required. The sort merge technique has been extensively studied [+] Ignoring internal 
comparison costs, the cost of a sort merge depends on the number of initial sorted subfiles, 
the merge factor, and the size of the blocks that are read from and written back into 
secondary storage. However, as we have assumed that the page is the fixed unit of storage 
allocation, we will ignore the possibility of improving the disk accessing cost by reading and 
writing blocks larger than one page each. _ i 


Consider the sorting of a file of p pages. Let b be the number of pages in main memory 
available for internal buffering. As a first step of the sorting process, $ sorted subfiles of 
the original file can be formed using s internal sorts. To optimize the subsequent merging 
process, s should be minimized by maximizing the size of each of the sorted subfiles. 
Hence, the size of each sorted subfile should be made equal to the size of the internal 
buffer, i.e. b pages. The cost of this phase of the sort-merge is 2ep page accesses (since the 
sorting of each of the subfiles is done internally without incurring extra page accesses). It is 
possible that the original p pages of the file are only partially occupied, so that the writing 
out of the sorted subfiles will incur less than p page accesses. Let u be the occupancy factor 
(or fraction of storage utilization) of the original file, then the cost of forming the subfites 
is p « (1 + u) since the total length of the sorted subfiles will only be psu pages. It is also 
possible that pu is not a multiple of b, in which case s-1 subfiles of length b, and one with 


length b’ (= p - b « (s - 1)) will be formed. 


Appendix 2 | 86 Analysis of Sorting Cost 


The merge phase consists of repeatedly merging sorted subfiles until a single one is 
obtained. Knuth [+] has shown that merge patterns can be represented as trees, and that the 
merging cost is proportional to the external path length of the corresponding tree. 
Therefore, sorting cost is minimized by choosing a tree with minimum external path length 
(sum of the level numbers of all the external nodes), such as a complete zrary_tree where z is 
as large as allowable by the internal buffer size. Allowing gne page for the buffering of | 
tuples from each subfile that participates in a merge, and one page for the output buffer, z 
will be chosen to be b-l. Given s initial sorted subfiles, (of which the first s-l are of length. 
b, and the last one is.of length b’), the algarithm for carrying out the merging. according to 
a complete z-ary tree pattern can. be described as follows. First add. dummy subf iles (of zero 
length) as necessary to make s = I(modulo (m-1)), to the front of the queue of initial. subfiles, 
then combine subfiles according to a first-in-first-out discipline, at any stage merging the z 
oldest subfiles at the front of the queue into a single file which is placed at the rear. The 
merging process terminates when a single sorted file. is left. The external path length L. for 
a complete z-ary merge tree is [o} 


(A2. 1) L=qs- L@2- s)/(z- 1) - 


where Ss = [p/bd] 


q = [log,s] 
Hence, the paging cast.C, for the merging phase for the.case that pu is a multiple of b is: 


(A2. 2) C, = bab 
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If pu is not a multiple of b, there will be s - 1 subfiles of length b, and one with length b’ (= 


p-b«(s-1)). In this case, the merge-sort cost Cy is: 
(A2. 3) Co =C,-q* (b - b') 


Hence, the merging cost C (s - 1, b’) for s - 1 subfiles of length b and one of length b’ 


merge 


is 


C, if b¢bdt 


[ASTRAHAN?75] 


{[BAYER72] 
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